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Abstract 

VT (Viterbi training), or hard EM, is an efficient way of parameter learning for proba- 
bilistic models with hidden variables. Given an observation y, it searches for a state of 
hidden variables x that maximizes p{x, y \ 6) hy coordinate ascent on parameters 9 and 
X. In this paper we introduce VT to PRISM, a logic-based probabilistic modeling system 
for generative models. VT improves PRISM in three ways. First VT in PRISM converges 
faster than EM in PRISM due to the VT's termination condition. Second, parameters 
learned by VT often show good prediction performance compared to those learned by 
EM. We conducted two parsing experiments with probabilistic grammars while learning 
parameters by a variety of inference methods, i.e. VT, EM, MAP and VB. The result is 
that VT achieved the best parsing accuracy among them in both experiments. Also we 
conducted a similar experiment for classification tasks where a hidden variable is not a 
prediction target unlike probabilistic grammars. We found that in such a case VT does 
not necessarily yield superior performance. Third since VT always deals with a single 
probability of a single explanation, Viterbi explanation, the exclusiveness condition that 
is imposed on PRISM programs is no more required if we learn parameters by VT. 

Last but not least we can say that as VT in PRISM is general and applicable to any 
PRISM program, it largely reduces the need for the user to develop a specific VT algorithm 
for a specific model. Furthermore since VT in PRISM can be used just by setting a PRISM 
flag appropriately, it makes VT easily accessible to (probabilistic) logic programmers. 

KEYWORDS: Viterbi training, PRISM, exclusiveness condition 



1 Introduction 

VT (Viterbi training) has been used for long time as an efficient parameter learning 

method in various research fields such as machine translation (|Brown et al. 19931) , 

speech recognition QJuang and Rabiner 1990[ [Strom et al. 1999|) . image analysis (jJoshi et al. 2006|) . 

parsing ( jSpitkovsky et al. 2010[ ) and gene finding ([Lomsadze et al. 2005 ) . Although 

VT is NP-hard even for PCFGs (probabilistic context free grammars), which is 

proved by encoding the 3-SAT problem into PCFGs (|Cohen and Smith 2010| , and 

is biased unlike MLE (maximum likelihood estimation) ( [Lember and Koloydenko 20071 ), 

it often outperforms and runs faster than the conventional EM algorithm. 



2 



T.Sato and K.Kuhota 



We introduce this VT to PRISM which is a probabihstic extension of Prolog 



pr 

(|Sato and Kameya 2001[ |Sato and Kameya 2008ilj. There are aheady multiple pa- 



rameter learning methods available in PRISM. One is the EM algorithm, or more 



generally MAP (maximum a posteriori) estimation (Sato and Kameya 2001). An- 
other is VB (variational Bayes) (jSato et al. 2009P which approximately realizes 
Bayesian inference and learns pseudo counts assuming Dirichlet priors over param- 
eters. They are implemented on PRISM's data structure called explanation graphs 
representing AND/OR boolean formulas made up of probabilistic ground atoms. 
Probabilities used in EM, MAP and VB are all computed by running the general- 
ized inside-outside algorithm ( |Sato and Kameya 2001] ) or its variant on explanation 
graphs. 

VT in PRISM runs on explanation graphs just like EM, MAP and VB but always 
deals with a single probability of a single explanation called Viterbi explanation or 
most probable explanation. Compared to EM that updates only parameters, VT 
alternately updates the Viterbi explanation and parameters, computing one from 
the other and vice versa, until the Viterbi explanation stops changing. Note that this 
results in earlier termination of the algorithm than EM because a small perturbation 
in parameters does not change the Viterbi explanation whereas it keeps EM running. 
Actually we found in our experiments in Section [3] that EM required 8 to 15 times 
more cycles to stop than VT. Also since VT updates parameters so that they 
maximize the probability of the Viterbi explanation, it is possible and probable 
that the final parameters by VT give a higher probability to the Viterbi explanation 
than those learned by EM, which intuitively explains why VT tends to yield superior 
performance to EM in prediction tasks such as parsing that computes the Viterbi 
explanation as a predicted value, as we see in Section SI 

In addition VT brings about a favorable side effect on PRISM. VT does not 
require the exclusiveness condition which is imposed on PRISM programs to ensure 
efficient sum-product probability computation. This is because VT always deals 
with a single probability of Viterbi explanation and hence there is no need for 
summing up probabilities of the non-exclusive explanations. Consequently PRISM 
can learn parameters by VT for programs that do not satisfy the exclusiveness 
condition. We will discuss more about the exclusiveness condition in Section |51 

VT thus improves PRISM in the following points: 

• Faster convergence due to a less number of iterations compared to EM 

• Ability to learn parameters good for prediction 

• The elimination of the exclusiveness condition imposed on programs 

From the viewpoint of statistical machine learning and PLP (probabilistic logic 
programming), on the other hand, we can first say that PRISM generalizes VT. 
That is, the VT algorithm implemented in PRISM works for arbitrary probabilistic 
models described by PRISM, a Turing complete language, including BNs (Bayesian 
networks), HMMs (hidden Markov models) and PCFGs, and hence eliminates the 



^ VT is available in PRISM2.1. PRISM2.1 is the latest version of PRISM downloadable from 
http: // sato -MWM. cs . titech. ac . jp/prism/ 



Viterbi training in PRISM 



3 



need for the user to derive and implement a specific VT algorithm for a specific 
model that can be described as a PRISM program. Also it makes VT easily acces- 
sible to probabilistic logic programmers because they can use VT just by setting 
learnjnode, one of PRISM's flags, appropriately. As a result, by switching the 
learnjnode flag he/she can choose the best parameter learning method for their 
models from EM, MAP, VB and VT, all available in PRISM2.I, without rewriting 
and adapting their programs to each parameter learning method. Indeed, the ex- 
haustive comparisons among EM, MAP, VB and VT done in our experiments seem 
quite costly in other environments. 

In what follows, we first review PRISM in Section [2] and then explain the basic 
idea of VT and reformulate it for PRISM in Section [3l We then apply VT to two 
probabilistic grammars in Section 2] using the ATR corpus where a hidden variable 
in a model is a prediction target. In Section [5l we deal with a different situation 
using an NBH (naive Bayes with a hidden variable) model whose hidden variable 
is not a prediction target. We explain the implication of VT on the exclusiveness 
condition in Section |6l Section [7] discusses related work. Section [8] is the conclusion. 



2 Reviev^fing PRISM 

For the self-containedness we review PRISM focusing on its computation mecha- 
nism. PRISM is one of the SRL (statistical relational learning) / PLL (probabilistic 
logic learning) languages (|Getoor and Taskar 20071 |De Raedt and KerstinglOOSl ) 



which aim at using rich expressions such as relations and first-order logic for com- 
plex probabilistic modeling. It is a probabilistic extension of Prolog enhanced with 
various built-in predicates for statistical machine learning such as predicates for 
parameter learning, Viterbi inference, model scoring, MCMC sampling and so on 
in addition to standard predicates equipped with Prolog. 



2.1 Probability naively computed 

Syntactically a PRISM program DB looks like a usual Prolog program except the 
use of probabilistic built-in predicate of the form msw(i,v) called "multi- valued 
random switch" (with switch name i) that represents a probabilistic choice using 
simple probabilistic events such as dice throwing; msw(i,?;) says that throwing a 
dice named i yields an outcome v. Let Vi — {vi, . . . , v^y.^} be the set of possible 

outcomes for i. The set msw(i, •) '= {msw(i,t;) | v e Vi} of msw atoms is given a 
joint distribution such that one of the msw(i, •)'s, say msw( z , v) , becomes exclusively 
true (others false) with probability 9i^y {v e Vi) where J^vev-^-^,'" ~ ^- other 
words, msw(i, •) stands for a discrete random variable Xi taking v with probability 
Si,v (■" £ Vi). In this sense we identify msw(i, •) with Xi and its distribution. 

The 0i.„'s are called parameters associated with i. They are directly specified by 
the user or learned from data. We define Pmsw{- \ S) as an infinite product of such 
distributions for msws where 9 stands for the set of all parameters. Then Pmsw(- | S) 
is uniquely extended by way of the least model semantics for logic programs to a 
cr-additive probability measure P_db(- | 9) over possible Herbrand interpretations of 
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DB which we consider as the denotation of DB {distribution semantics) (|Sato 19951 



Sato and Kameya 2001). In the following we omit 9 when the context is clear for 
the sake of brevity. 

Let G be a non-msw atom which is ground. Pdb{G), the probability of G, can be 
naively computed as follows. First reduce the top-goal G using Prolog's exhaustive 
top-down proof search to an equivalent propositional DNF formula explg(G) = 
ei V • • • V e/cl|j where (1 < i < /c) is a conjunction mswi A ■ • ■ Amsw„ of msw atoms 
such that mswi A • • • A msw„, DB h G. Each is called an explanation for G. Then 
assuming 

[Independence condition] msw atoms in an explanation are independent: 

F£)B(msw A msw') = i-'£)B(msw)i-'£iB(msw') 
[Exclusiveness condition] Explanations are exclusive: 

PoBiet A e-j) = if i ^ j 

we compute Pdb{G) as 

Pdb{G) = PDB{ei) + --- + PDB{€k) 

Pdb (ei ) = Pdb (mswi ) ■ • ■ Pdb (msw„ ) for e , = mswi A ■ • ■ A msw„ • 

Recall here that msws with different switch names are independent by construction 
of Pmsvi' I S)- We may further assume that msw atoms with the same switch name are 
iid (independent and identically distributed). That is when msw(i , v) and msw(i , v') 
occur in a program, we consider they are the results of sampling the same msw(i , •) 
twice. This is justified by hypothetically adding an implicit argument, trial-id t 
( Sato and Kameya 2001] ), to msw(j,-) and assume that msw(j,t,-)s have a prod- 



uct of joint distributions just like the case of msw/2 which makes msw(i ,t ,■) and 
msw(i,t',-) (t ^ t') iid. So in what follows we assume the independence condition 
is automatically satisfied. 

Contrastingly the exclusiveness condition cannot be automatically satisfied. It 
needs to be satisfied by the user, for example, by writing a program so that it gen- 
erates an output solely as a sequence of probabilistic choices made by msw atoms 
(modulo auxiliary non-probabilistic computation). Although most generative mod- 
els including BNs, HMMs and PCFGs are written this way, naturally, but there are 
models which are unnatural or difficult to write this way (|De Raedt et al. 2007|) . 
Relating to this, observe that Viterbi explanation, i.e. the most likely explanation 
e* for G, is computed similarly to Pdb{G) just by replacing sum with argmax: 
e* argmax^ggjj-pj ^^^PuBi^-), and does not require the exclusiveness condition to 
compute because it only deals with the probability of a single explanation. We will 
discuss more about the exclusiveness condition in Section [6l 



^ The equivalence means that G and explQ(G) denote the same Boolean random variable in view 

of the distribution semantics of PRISM. 
^ When convenient, we treat explQ(G) as a bag {e\, . . . , e^} of explanations. 
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2.2 Tabled search, dynamic programming, probability computation and 

Viterbi inference 

So far our computation is naive. Since there can be exponentially many expla- 
nations, naive computation would lead to exponential time computation. PRISM 
avoids this by adopting tabled search in the exhaustive search for all explanations 
for the top-goal G and applying dynamic programming to probability computa- 
tion. By tabling, a goal once called and proved is stored (tabled) in memory with 
answer substitutions and later calls to the goal return with stored answer substitu- 
tions without processing further. Tabling is important to probability computation 
because tabled goals factor out common sub-conjunctions in explQ(G'), which re- 
sults in sharing probability computation for the common sub-conjunctions, thereby 
realizing dynamic programming which gives exponentially faster probability com- 
putation compared to naive computation. 

As a result of exhaustive tabled search for all explanations for G, PRISM obtains 
a set of propositional formulas called deGning formulas of the form H 4^ BiV . . .V Bh 
for every tabled goal that directly or indirectly calls msws. We call the heads of 
defining formulas defined goals. Each Bi {1 < i < h) is recursively a conjunction 
Ci A . . . A Cm A mswi A ... A insw„ (0 < m, n) of defined goals { Ci , . . . , Cm } and msw 
atoms {mswi, . . . ,msw„}. We introduce a binary relation H )~ C over defined goals 
such that H )~ G holds if H is the head of some defining formula and G occurs 
in the body. Assuming 'V is acyclic, we extend it to a partial ordering over the 
defined atoms. We denote by expl(G) the whole set of defining formulas and call 
expl(G) the explanation graph for C like the non-tabled case. 

Once expl(G') is obtained, since defined goals are layered by the 'V relation by 
our assumption where the a defining formula in the bottom layer has only msws 
in the body whose probabilities are known, we can compute probabilities by sum- 
product operatior[f| for all defined goals from the bottom layer upward in a dynamic 
programming manner in time linear in the size of expl(G'), i.e. the number of atoms 
appearing in expl(G'). 

Compared to naive computation, dynamic programming on expl(G') can reduce 
time complexity for probability computation from exponential time to polynomial 
time. For example PRISM's probability computation for HMMs takes 0{L) time for 
a given sequence with length L and coincides with the well-known forward-backward 
algorithm for HMMs. Likewise PRISM's probability computation for PCFGs takes 
0{L^) time for a sentence with length L and coincides with the computation of 
inside probability for PCFGs. More interestingly, BP (belief propagation), one of the 
standard algorithms for probability computation for BNs, coincides with PRISM's 
probability computation applied to PRISM programs that describe junction trees 
(|Sato 200711 . 

Viterbi inference, i.e. the computation of the Viterbi explanation and its prob- 
ability, is similarly performed on expl(G) in a bottom-up manner like probability 

* The top -goal G is a tabled goal . Tabled goals exce pt the top-goal are called "intermediate 

goals" in |Sato and Kameya (2001| |, |Zhou et al. (20081 1. 
^ The exclusiveness and independence conditions are inherited from the naive case. 
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computation stated above. The only difference is that we use argmax instead of sum. 
In what follows we look into how the Viterbi explanation is computed. We use for 
the set of all parameters. Let H hea. defined goal and H <^ BiV . . .V Bh the defining 
formula for H in expl( G) . Write Hi = Ci A . . . A C,„ A msw ( ii , ui ) A . . . A msw ( i„ , u„ ) 
(1 < z < /i) and suppose recursively that the Viterbi explanation e*Q^ (1 < J < m) 
has already been calculated for each defined goal in Cj in Bi . Then the Viterbi ex- 
planation e*Q for Bi and the Viterbi explanation for H are respectively computed 
by 

^*B, — ^*Ci A • ■ • A e^^^^ A msw(ii ,vi) A ... A msw(i„ , «„) 

= aTgmeLXB.PDB{e% \ 0) 

where Pdb{<^*b, I d) = PoBiecJ ■ ■ ■ PDB{£*cJS^uvl ' ' ' St„,vn- 

Here is a parameter associated with msw(ii,wi) and so on. In this way the 

Viterbi explanation for the top-goal G is computed in a bottom-up manner by 
scanning expl( G) once in time linear in the size of expl( G) , i.e. exactly the same time 
complexity as probability computation; For example 0{L) for HMMs and 0{L^) 
for PCFGs where L is respectively the length of sequence and that of sentence. 

Parameter learning in PRISM, be it EM, MAP, VB or VT(explained next), is 
based on computation by dynamic programming on expl(G). For example EM in 
PRISM computes generalized inside probabilities and generalized outside probabil- 
ities for defined goals in expl( G) using dynamic programming and calculates expec- 
tations of the number of occurrences of msw atoms in an SLD proof for the top-goal 
to update parameters in each iteration, similarly to the Inside-Outside algorithm 
for PCFGs ( |Sato and Kameyar2001J . MAP (maximum a posteriori) estimation and 
VB (variational Bayes) inference are also performed similarly (jSato et al. 2009| 
|Sato and Kameya 20081 ). 

3 Viterbi training and PRISM 

In this section we adapt VT to the distribution semantics of PRISM and derives 
the VT algorithm for PRISM. 

3.1 Viterbi training 

Here we explain the basic idea of VT without assuming specific distributions. Let 
X be hidden variables, y observed ones and p{x, y \ 6) their joint distribution with 
parameters 0. We assume x and y are discrete. MLE estimates parameters from 
y as the maximizer of the (log) likelihood function LsAiiy \ O)'- 

LEM{y I 0) =^ log ^ p{x, y I 9)- 

X 

In the case of MAP (maximum a posteriori) estimation, we add a prior distribution 
p(0) and use LMAp{y \ 0) below as an objective function: 

LMAp{y\6) = log ^ p{x,y\e)p{ey 
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What VT does is similar to MLE and MAP but it uses a different objective 
function Lyriy \ &) defined as 

Lvriy I 0) log maxj; p{x, y \ 0)p{0)- 

VT estimates parameters as the maximizer of Lvriu \ &) by coordinate ascent 
that alternates the maximization of \ogp{x, y \ 0) w.r.t. x and the maximization of 
\ogp{x, y I 9) w.r.t. 0: 

a;(") = argmax^ log p(a;,?/ I 6l(")) (1) 
6l("+i) = argmaxg log p(a;("\ 2/ I 0)p(6l) (2) 

Starting with appropriate initial parameters 9^*^\ VT iterates the above two steps 
and terminates when holds (recall that random variables x and y are 

discrete). Proving the convergence property of VT is straightforward. 

Lvriy \0^''^^^) = logp(a;("+i),t/ I 6'("+i))p(6l("+i)) 

> logp(a;("\2/ I 6l("+i))p(6l("+i)) 

> logp(x("\2/ I 

50 Lvriy I 0^"'>) < Lvriy \ 6'("+i)) < for every n = 0,1,... Since {Lyriy \ 

is a monotonically increasing sequence with an upper bound, it converges 
as n goes to infinity. 

3.2 VT for PRISM 

Here we reformulate VT in the context of PRISM. Let DB be a PRISM program 
with parameters 9 and Pdb{- \ 9) a probability measure defined by DB. Also 
let G = Gi, . . . , Gt be observed goals, and expl(G't) {1 < t < T) the set of all 
explanations for Gt such that et,DB h Gt- G = Gi,...,Gt corresponds to 
observed variables y and ei, . . . , to hidden variables x in p{x, y \ 0) respectively 
in equations ([T]) and ([2]) in Subsection 13.11 

Let mswCi,-) be the set of msw atoms for a multi- valued random switch i as 
before that represents a probabilistic choice from a finite set Vi of possible out- 
comes such that msw(i,?;) {v G Vi) becomes exclusively true with probability 
Since X^uei/ ^i-f ~ 1 holds, 6i is a point in the probability simplex. We put 

51 = {0i,v}vev, a-nd — {Ji^i where i ranges over possible switch names. 

We introduce as a prior distribution Dirichlet distribution Pr,u {di) oc Jl-ue y ^^u" 
with hyper parameters {ai,y}y(zy. over 9i and their product distribution PDir(^) '= 
Yii -PDir(^i)- In the following, to avoid the difficulty of zero-probability encountered 
in parameter learning, we assume pseudo count di, — o^i y 1^0 and use Si u in 
place of ai^v 

^ In PRISM, Vi is declared by values/2-3 predicate. 
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Finally recall the Viterhi explanation for a goal Gt is a most probable expla- 
nation for Gt given by 

e* = argmax^^g^^p^GO-Poslet I ^')- (3) 

By substituting G — Gi, . . . , Gt for y and ei, . . . , er for a; in the definition of 
Lvriy I d), the objective function Lvt{G \ 6) for VT in PRISM is now computed 
as follows. 

Lyt{G I 0) 

T 

= log maXejgexpl(Gi),...,eTeoxpl(Gr) H PoBi^t, Gt \ e)Pmi{0) 
T 

= log W maXe,gcxpi(Gi)^£iB(et | 9)PDh {0) 
t=l 

T 

= log n PDB{e\ I 0)/'Dir(0) 

T-r cr, „(«,*) +<5,„ 



where w" ranges over those such that msw(i,t;) appears in some and ai^v{el) 
is the count of msw(i,w) in e^. 

Likewise by substituting G = Gi, . . . , Gt for y and ei, . . . , for a; in equations 
(HI and ([2]) respectively and using the definition of , we obtain the VT algorithm 
for PRISM which alternately executes ([5]) and (O where S'-"-' stands for the set of 
parameters {0\"^} at step n. 



t(n) 



argmax,,g,,pi(G<) Posiet \ 0^"^) {I < t < T) (5) 



T 



J2<^^A4^"^)+kv (6) 



Here ([5]) corresponds to ([T]) and (jH) to ([2|) respectively. 

Using ([U and ([6]), VT in PRISM is performed as follows. Given observed goals 
G = Gi , . . . , Gt , we first perform tabled search for all explanations to build ex- 
planations graphs expl(Gt) for each t {I < t < T). Then starting from the initial 
parameters 0'-°'', we repeat ([5]) and ([6|) alternately while computing the Viterbi 
explanations e**-"-* in ([5]) by dynamic programming over expl(Gt) as explained in 
SectionEluntil 6^'"+^^ = e;^"^ holds for alH (1 < i < T). {6'|"+^^} are then learned 
parameters. 

Having derived the VT algorithm for PRISM, we examine the effect of the ter- 
mination condition = (1 < t < T") on the convergence of VT. As we 
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remarked in Section [l] this condition means VT terminates as soon as the Viterbi 
explanations converge, i.e. there is no change of the Viterbi explanations between 
step n and step n + 1 whereas EM always runs until convergence of parameters. As 
a result since a small change of parameters does not affect the Viterbi explanation 
but keeps EM running, VT tends to converge in much less number of iterations 
than EM. 

To empirically check this, we conducted parameter learning of probabilistic gram- 
mars by VT and by EM using PRISM and compared their convergence behavioi0. 
We used two probabilistic grammars, a PCFG and a PLCG (probabilistic left-corner 
grammar) for the ATR corpus (jUratani et al. 1994^ (their details are described in 
the next section), and measured the average number of iterations and learning 
tim^ required for convergence over ten runs. Table [1] summarizes the results with 
standard deviations in parentheses. 



Table 1. Average number of iterations and learning time for convergence 



Iterations 


Learning 


; time(sec) 


VT EM 


VT 


EM 


PCFG 8.10(2.28) 123.6(3.23) 


0.45(0.11) 


6.29(0.16) 


PLCG 15.80(4.73) 144.2(43.51) 


1.55(0.36) 


11.686(3.64) 



Looking at the table, we see that VT required only a small number of iterations 
to converge compared to EM; the ratio of average number of iterations of VT to EM 
is 1:15.2 w.r.t. the PCFG and 1:8.3 w.r.t. the PLCG. We also note that the ratio 
of average learning time|f| is similar to that of iterations, 1:13.8 w.r.t. the PCFG 
and 1:7.4 w.r.t. the PLCG respectively. It therefore seems natural to conclude that 
VT learns parameters with much less number of iterations and thereby much faster 
than FA® 

Since VT is a local maximizcr, it is sensitive to the initial condition like EM. So we 
need to carefully choose ^'^^ Uniform distributions for ( [Spitkovsky et al. 2010D 
and ef^^\l < t < T) (jCohen and Smith 2010p are possible choices. In practice, we 

All experiments in this paper are done on a single machine with Core 17 Quad 2.67GHzx2 
CPU and 72GB RAM running OpenSUSE 11.2, using PRISM2.1. 

® We used a built-in predicate prism_statistics(em_time , a;) to measure learning time which 
returns in x time used by the learning algorithm. 

® Learning time displayed by the PRISM system after learn is "total learning time" which 
includes search time for explanations and other overhead time such as copying msws in the 
memory, in addition to actual learning time reported by prisin_statlstics(eni_tinie,x). Since 
such extra-time accounts for a large percent of total learning time, it can happen that the 
difference in total learning time between EM and VT is smaller than Table [T] 
In the table, the difference of VT and EM in the number of iterations is statistically significant 
for both grammars by unpaired t-test at the 5% significance level with the Bonferroni correction. 
This applies to learning time as well. 
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further add random restart to alleviate the sensitivity problem. For example in 
the experiments in the next section, we repeated parameter learning 50 times with 
random restart for each learning and selected the parameter set giving the largest 
value of the objective function Lvt{G \ 0) computed by (|4]). 

4 Learning experiments with probabilistic grammars 

In this section we apply VT to parsing tasks in natural language processing where 
observable variables are sentences and hidden variables are parse trees. We predict 
parse trees for given sentences using probabilistic grammars (PCFG and PLCG) 
whose parameters are learned by VT and compare the parsing performance with 
each of EM, MAP and V^. 

4.1 VT for PCFGs 

Prior to describing the parameter learning experiment with a PCFG by VT, we 
briefly review how to write PCFGs in PRISM. In PCFGs, sentence derivation is 
carried out probabilistically. When there are k PCFG rules 6i : A ^ f3i, . . . ,dk 
A f3k for a nonterminal A with probabilities di, ... ,9k {Oi + ■ ■ ■ + 9k = 1), ^ is 
expanded by j4 — > /3i into /3i with probability 9i . The probability of a parse tree r is 
the product of probabilities associated with occurrences of CFG rules in r and the 
probability of a sentence is the sum of probabilities of parse trees for the sentence. 

Writing PCFG programs is easy in PRISM. Fig. [T] is a PRISM program for a 
PCFG { 0.4:S-^S S, 0.3:S^a, 0.3:S^b }. In general, PCFG rules such as { 
6*1 : A ^ Pi, ... ,9k : A ^ Pk } are encoded by values/3 declaration as 

values Ck' ,Ll3i,...,l3k1 ,L9i,...,dk1) 

where /3i(l<i<fc)isa Prolog list of terminals and nonterminals. 

We wrote a PCFG program as shown in Fig.[T]for the ATR corpus (jUratani et al. 1994^ 

We assume the reader is familiar with the basics of parsing theory. 



valuesCS' , [ [' S' , ' S' ] , [a] , [b] ] , [0 . 4, . 3, . 3] ) . 

pcfg(L) :- pcfg(['S'] ,L, []) . 
pcfg([A|R] ,L0,L2) :- 

( get_values(A,_) -> 7, msw(A,_) exists, so 
msw(A,RHS), 7, A is a nonterminal 

pcfg(RHS,LO,Ll) 
; L0=[A|L1] ), 
pcfg(R,Ll,L2) . 
pcfg([] ,L,L). 



Fig. 1. A PCFG program 
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using an associated CFcj^. The corpus contains labeled parse trees for 10,995 
Japanese sentences whose average length is about 10. The associated CFG0 com- 
prises 861 CFG rules (168 non-terminals and 446 terminals) and yields 958 parses/sentence 
on average. We applied four learning algorithms, i.e. VT, EM, MAP and VB 
(jSato at al. 2009^ available in PRISM2.1 to the PCFG program for the ATR cor- 
puj^^ and compared the performance of VT with other learning methods. 

We conducted eight-fold CV (cross validation) for each algorithnj^ to evalu- 
ate the quality of learned parameters in terms of three performance metrics i.e. 
LT(labeled tree), BT(bracketed tree) and 0-CB(zero crossing brackets) ([Goodman 1996p . 
These metrics are computed from Tc , the set of parse trees in a test corpus which 
are considered correct and Tg , the set of parse trees predicted for sentences in the 
test corpus by a parsing algorithm. LT is defined as \Tc fl Tg\/N where l^l de- 
notes the number of elements in a set S and = | Tg | = | Tc | . It is the ratio of 
correctly predicted labeled parse trees to the total number of labeled parse trees. 
Compared to LT, BT is a less strict metric that ignores nonterminals in parse trees. 
Let Tg be the set of unlabeled trees obtained by removing nonterminals from Tg 
which coincide with the corresponding unlabeled trees in Tc . Then BT is defined as 
I T'g\/N. Finally 0-CB is the least strict metric in the three metrics. We say brackets 
(wj, . . . ,Wj) in a tree r is inconsistent with another tree r' if r' contains brackets 
(ws, . . . , wt) such that s<i<t<joTi<s<j<t. Otherwise they are consistent 
with r'. Let Tg be the set of trees in Tg which have no inconsistent brackets with 
the corresponding trees in Tc- Then 0-CB is given by | Tg\/N. 

To perform cross validation, the entire corpus is partitioned into eight sections. 
In each fold, one section is used as a test corpus and sentences in the remaining 
sections are used as training data. For each of EM, MAP, VT and VB, parameters 
(or pseudo counts) are learned from the training data. A parse tree is predicted, 
i.e. the Viterbi explanation is computed for each sentence in the test corpus using 
learned parameters or using the approximate a posterior distribution learned by 
VB. The predicted trees are compared to answers, i.e. the labeled trees in the test 
corpus to compute LT, BT and 0-CB respectively. The final performance figures are 
calculated as averages over eight folds and summarized in Table [2] with standard 
deviations in parentheses. 



In the experiment, to speed up parsing, we partially evaluated the PCFG program with indi- 
vidual CFG rules and used the resulting specialized program, 
copy right protected. 

In PRISM, EM is a special case of MAP inference. We used random but almost uniform 
initialization of parameters and set uniformly pseudo counts 5^ „ to 1.0~® for EM and 1.0 for 
MAP and VT, respectively. Similarly we unifor mly set hyper pa rameters «i^„ to 1.0 for VB. 
The number of candidates for re-ranking in VB ISato et al. 2009l l was set to 5. 
In all cases, we set the number of random restart to 50 and used the best parameter set that 
gave the largest value of objective functions, i.e. Lem for EM, Lmap for MAP and LvT for VT. 
For the case of VB that learns pseudo counts, we chose the best set of pseudo counts giving the 
highest free energy ijSato et al. 2009)l . 

We chose eight-fold CV for parallel execution of learning by our machine. 
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Table 2. Parsing performance by PCFG 







Learning method 




Metric 


VT 


EM 


MAP 


VB 


LT(%) 


74.69(0.87) 


70.02(0.88) 


70.31(1.13) 


72.13(1.10) 


BT(%) 


77.87(0.84) 


73.10(1.01) 


73.45(1.20) 


75.46(1.13) 


0-CB(%) 


83.78(0.92) 


84.44(0.89) 


84.89(0.84) 


87.08(0.87) 



We statistically analyzed the parsing performance by Dunnett's tesil I. The result 
is that VT outperformed all of EM, MAP and VB in terms of LT and BT at the 
5% level of significance but did not so in terms of 0-CB. This is understandable if 
we assume that there are many parse trees that can give high scores in terms of 
less restrictive metrics such as 0-CB but since VT concentrates probability mass 
on a single tree, those promising trees are allocated little probability mass by VT, 
which results in relatively low performance of VT in terms of 0-CB. 

So far we examined parsing performance by parameters obtained from incom- 
plete data (sentences in the corpus). We also examined parsing performance using 
8-fold CV by parameters learned from complete data, i.e. by parameters obtained 
by counting occurrences of CFG rules in the corpus. The result is LT:79.06%(1.25), 
BT:85.28%(0.69), 0-CB:95.37%(0. 26) (figures in parentheses are standard devia- 
tions). These figures are considered as the best possible performance. We notice 
the gap in parsing performance between the complete data case and the incomplete 
data case tends to become wider as the performance metric gets less restrictive in 
the order of LT, BT and 0-CB. 

Another thing to note is that the objective functions for EM, MAP and VB are 
similar in the sense that they all sum out hidden variables whereas the objective 
fimction for VT retains them. This fact together with Fig. [2] seems to suggest that 
parsing performance is more affected by the difference among objective functions 
than the difference among learning methods. 

4.2 VT for PLCGs 

PCFGs assume top-down parsing. Contrastingly there is a class of probabilistic 
grammars based on bottom-up parsing for CFGs called PLCGs (probabilistic left- 
corner grammars) ( |Manniiig^ "T997llRoark and Johnson 1999|[Vm Uytsel et al. 2001D . 
Although they use the same set of CFG rules as PCFGs but attach probabilities 
not to expansion of nonterminals but to three elementary operations in bottom-up 

^® We used Dunnett's test for multiple comparisons of means with VT as the control to avoid 
inflating the significance level. Figures in bold face indicate best performance. 



Viterbi training in PRISM 



13 



valuesdcCS' , 'S') , [ruleCS' , ['S' , 'S'] )] ) . 
valuesdcCS' ,a) , [ruleCS' , [a])]) . 
valuesdcCS' ,b) , [ruleCS' , [b])]) . 
values(f irstCS') , [a,b] ) . 
values(attCS') , [att.pro] ) . 

plcg(L):- g_call(['S'] ,L, []) . 

g_call([] ,L,L) . 

g_call([G|R] , [WdiL] ,L2) :- 
( G = Wd -> LI = L 7, shift operation 
; msw(f irst(G) ,Wd) ,lc_call(G,Wd,L,Ll) ), 
g_call(R,Ll,L2) . 

lc_call (G,B,L,L2) : - '/, B-tree is completed 

msw(lc(G,B) ,rule(A, [B|RHS2])) , 
( G = A -> true ; values (lc(G, A) ,_) ), 
g_call(RHS2,L,Ll) , '/. complete A-tree 

( G = A -> att_or_pro(A,Op) , 

( Op = att -> L2 = LI ; lc_call(G,A,Ll,L2) ) 
; lc_call(G,A,Ll,L2) ). 

att_or_pro ( A , Op) : - 

( values(lc(A,A) ,_) -> msw(att (A) , Dp) ; Op=att ). 



Fig. 2. A PLCG program 



parsing, i.e. shift, attach and project. As a result they define a different class of dis- 
tributions from PCFGs. In this subsection we conduct an experiment for parameter 
learning of a PLCG by VT. 

The objective of this subsection is two fold. One is to apply VT to a PLCG, 
which seems not attempted before as far as we know, and to examine the parsing 
performance. The other is to empirically demonstrate the universality of our ap- 
proach to VT that subsumes differences in probabilistic models as differences in 
explanation graphs and applies a single VT algorithm to the latter. 

Programs for PLCGs look very different from those for PCFGs. Fig.[2]is a PLCG 
program which is a dual version of the PCFG program in Fig. [1] with the same 
underlying CFG {S— >-S S, S— >-a, S— 7-b}. It generates sentences using the first set 
of 'S' and the left-corner relation for this CFG (values/2 there only declares 
the space of outcomes). The program works as follows. Suppose nonterminals G 
and B are in the left-corner relation and G is waiting for a B-tree, i.e. a subtree 
with the root node labeled B, to be completed. When a B-tree is completed, the 
program probabilistically chooses a CFG rule of the form A B/3 to further grow 
the B-tree using this rule. Upon the completion of the A-tree and if G = A, the 
attach operation or the projection is probabilistically chosen. By replacing values 
declarations appropriately, this program is applicable to any PLCG. 

We have developed a PLCG program similarly to the PCFG program for the 
ATR corpus and applied VT, EM, MAP and VB to it to learn parameters. We 
measured parsing performance by learned parameters in terms of LT, BT and 0- 
CB by eight-fold CV for each of VT, EM, MAP and VB and obtained Table [3] 
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Table 3. Parsing performance by PLCG 







Learning method 




Metric 


VT 


EM 


MAP 


VB 


LT(%) 


76.26(0.96) 


71.81(0.91) 


71.17(0.93) 


71.15(0.90) 


BT(%) 


78.86(0.70) 


75.17(1.15) 


74.28(1.12) 


74.28(1.00) 


0-CB(%) 


87.45(1.00) 


86.49(0.97) 


86.03(0.67) 


86.04(0.71) 



(standard deviations in parentheses) . We compared the parsing performance of VT 
with EM, MAP and VB by Dunnett's test at the 5% level of significance similarly 
to the PCFG case. This time however VT outperformed all of EM, MAP and VB 
by all metrics, i.e. LT, BT and 0-CB. 

5 Applying VT to classification tasks 

In the previous section, we conducted learning experiments with a PCFG and a 
PLCG in which the prediction target was parse trees that coincide with a hidden 
variable in a probabilistic model. In this section, we deal with a different situation 
where a prediction target differs from a hidden variable. We apply VT to classifica- 
tion tasks using an JVBJf (naive Bayes with a hidden variable) model whose hidden 
variable is summed out and instead an observable variable, class label, is predicted 
for the given data. 




Fig. 3. A Bayesian network for NBH model 

Before explaining classification tasks, we review NBH for completeness (jSato 201ip . 
NBH is an extension of NB (naive Bayes) with a hidden class variable HC as illus- 
trated in Fig. [3] It defines a joint distribution 

n 

p{Ai,...,An,HC,c \e) = Y\p{Aj\HC,c,e)p{HC\c,e)p{c \e) 
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values (class , [democrat .republican] ) . 7, class labels are democrat or republican 
values(attr(_A,_C,_HC) , [y,n] ) . '/. attribute values are y or n 

nbayes(C,Vals) :- 

msw(class,C) ,msw(hclass (C) ,HC) ,nbh(l , C ,HC , Vals) . 
nbh(J,C,HC, [ViVals]) 

( V == '?' -> msw(attr(J,C,HC) ,_) 'I '?' indicates missing value 

; msw(attr(J,C,HC) ,V) ) , 

Jl is J+1, 

nbh(Jl,C,HC,Vals) . 
nbh(_,_,_, []) . 



Fig. 4. An NBH program 

where 9 is model parameters, the Aj's attributes of observed dat J^. C a class 
and HC a hidden class. It is easily seen from the equation (O below that NBH 
represents the data distribution in a class C as a mixture of data distributions 
indexed by HC. 

P{A,,...,A^\ C,e) = Y.P{Ai,...,An\HC,C,e)P{HC\C,e) 

HC 

n 

HC 3 = 1 

The role of HC is to cluster data in a class C so that a distribution P{Ai, . . . , An \ 
HC, C,6) in each cluster HC satisfies the independent condition P{Ai, . . . , An \ 
HC, C, 0) = n]Li P{Aj I HC, C, 9) imposed on NB as much as possible. NBH 
was introduced in (jSato 201ip as a simple substitute for more complicated vari- 
ants of NB such as TAN ([Friedman et al. 1997p . AODE (|Webb et al. 2005|) . BNC 
(|Castillo and Gama 2005 t, FBC JSTand Zhang 2006D andHBN ( |Jiang et al. 2Q09D . 
Given data Ai, . . . , An, we classify Ai, . . . , An as a class C* by 

C* = argmaX(^P(C I 

= argmaxc5Z^(f^ I ■ff(^,^i:---,^«,^)^(^C7 Mi,...,^„,0)- (8) 

HC 

Note here that the hidden variable, HC , is not a prediction target unlike proba- 
bilistic grammars. It is just summed out. However we expect a sub-classifier P{C \ 
HC, Al, . . . , An, 9) indexed by HC performs better than P{C \ Ai, . . . , An, 9), the 
original NB, in each cluster and so does their mixture (see equation ([8])). 

To evaluate the quality of parameters learned by VT for classification tasks, we 
conducted a learning experiment with NBH using ten data sets from the UCI Ma- 
chine Learning Repositorv ([Frank and Asuncion 2010p . In the experiment training 



We interchangeably use the attributes A\,. . . , A„ as data when the context is clear. 
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data is given as a set of tuples C ,Ai, . . . , An consisting of a class C and attributes 
Ax, . . . , An. Parameters (or pseudo counts) are learned by a PRISM program shown 
in Fig. |jp| which is also used for predicting class labels in test data. A values/2 
declaration values (class , [democrat .republican] ) in the program tells PRISM 
to introduce two msw atoms, msw(class .democrat) and msw(class .republican) 
that represent a probabilistic choice between democrat and republican as a class, 
implicitly together with their parameters 9 class, democrat and 9 class, republican such 
that 9 class, democrat + 9 class , republican = 1- This program assumcs that attributes are 
numbered and missing values in a data set are replaced with '? ' . 

We obtained the classification accuracy of NBH for each combination of data set, 
learning method (VT, EM, MAP, VB), the number of clusters #HC in a class C 
(from 2 to 15) and hyper parameters ({0.1,1.0} as ai^v for VB, and the same as 
pseudo counts 6i^v for VT,MAP) as the average over ten times ten-fold C\l^ except 
nursery, mushroom and kr-vs-kp data sets in which case ten-fold CV was used. We 
similarly obtained the classification accuracy of NB as baselintF^I. 

Table m summarizes classification accuracies of NB and NBH. Accuracy for NBH 
in the table is the best accuracy obtained by varying 4t^HC and hyper parameters 
as we mentioned for the given learning method and data set. Figures in bold face 
indicate the best accuracy achieved in each data set. The table shows that for most 
data sets NBH performed better than NB as we expected. Actually the difference 
in accuracy between NB and the best one for NBH is statistically significant by un- 
paired t-test at the 5% level with the Bonferroni correctiorl^ for all data sets except 



This program is for the vote data set from the repository. 

We used ten times ten-fold CV when possible to have robust estimates though computationally 
expensive l |Japkowicz and Shah 2011[ l. 

We used the EM algorithm for parameter learning of NB as there are missing data in some 
data sets. 

As ten data sets arc used, the significance level is set to 5%/10 = 0.5%. 



Table 4. Accuracy by VT, EM, MAP and VB 



NB NBH : Learning method 



Data set Size EM(%) VT(%) EM(%) MAP(%) VB(%) 



nursery 


12960 


90.23 


92.93 


99.40 


99.65 


97.45 


mushroom 


8124 


99.57 


100.00 


100.00 


100.00 


99.99 


kr-vs-kp 


3196 


87.86 


88.69 


91.59 


92.34 


88.90 


car 


1728 


85.86 


90.97 


97.67 


97.82 


94.68 


votes 


435 


90.29 


96.00 


95.66 


96.51 


96.05 


dermatology 


336 


97.73 


97.98 


97.51 


98.06 


98.17 


glass 


214 


72.82 


75.86 


76.84 


76.66 


76.53 


iris 


150 


94.40 


95.07 


95.13 


95.07 


95.07 


breast-cancer 


150 


72.52 


72.52 


70.07 


72.76 


72.83 


zoo 


101 


95.07 


96.55 


97.42 


96.95 


96.62 
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dermatology, iris and breast-cancer. The superiority of NBH over NB demonstrated 
in this experiment is interpreted as an effect of clustering in a class by introducing 
a hidden variable HC. 

Comparing the classification accuracies by four parameter learning methods ap- 
plied to NBH, we notice that VT's performance is comparable to the other three, 
i.e. EM, MAP and VB except for the case of nursery, kr-vs-kp and car data sets. 
For these data sets VT's accuracy is worse than the best one achieved by one of 
the three learning methods, which is statistically confirmed by unpaired t-test at 
the 5% level of significance with the Bonferroni correction. So from the viewpoint 
of a learning experiment with NBH, we cannot say, regrettably, VT outperformed 
EM, MAP and VB for all data sets. However, the result is understandable if we 
recall that while the predication target in the experiment is a class variable C, VT 
optimizes parameters not for C but for the hidden variable HC which is summed 
out and hence only indirectly affects prediction. 



6 Removing the exclusiveness condition 

PRISM assumes the exclusiveness condition on programs to simplify probability 
computation as explained in Section [2] It means we cannot write a program clause 
H ^ By B' unless Pdb{B A 5' | 0) = is guaranteed ( |Sato and Kameya 200l| ). 
Although most of generative probabilistic models such as BNs, HMMs and PCFGs 
are naturally described as PRISM programs satisfying the condition, removing it 
certainly gives us more freedom of probabilistic modeling. Theoretically it is pos- 
sible to remove it by introducing BDDs (binary decision diagrams) as ProbLog 
(|De Raedt et al. 20071 [Kimmig et al. 20081 ) and PITA ( [Riguzzi and Swift 201 ID do, 



and their related systems, LeProbLog ()Gutmann et al. 2008p . LFI-ProbLog (|Gutmann et al. 201 ip 



and EMBLEM(Bellodi and Riguzzi 2012), offer parameter learning based on prob- 



ability computation by BDDs, though with different learning frameworks from 
PRISM. If, however, we are only interested in obtaining the Viterbi explanation 
after parameter learning as we are in many cases, VT gives us a way of doing it 
without BDDs even for programs that do not satisfy the exclusiveness condition. 
This is because VT does not require the exclusiveness condition to execute equations 
(O and ([6]) that always deal with a single explanation and a single probability. 

We next give an example of parameter learning by VT followed by the com- 
putation of the Viterbi explanation for a program that violates the exclusiveness 
condition. Fig. [5] is a PRISM program translated from a ProbLog progran@ that 
computes a path between two nodes (and its probability) in a graph. The graph 
has six nodes. Edges are assigned probabilities and we express this fact by at- 
taching an msw atom to an atom d_e(a;,t/) representing an edge x — y in the pro- 
gram. For example (directed) edge d_e (1 , 2) between node 1 and node 2 is assigned 
probability 0.9 as indicated by msw(d_e(l,2) ,on) following its value declaration 
values (d_e (1 ,2) , [on,of f ] , [0.9,0.1]) in the program. 



The program is taken from the tutorial at ,http: //dtai . cs .kuleuven.be/problog71 
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values (d_e(l, 2) , [on, off] , [0.9,0.1]) . 
values(d_e(3,4) , [on, off] , [0.6,0.4]) . 
values(d_e(2,6) , [on, off] , [0.5,0.5]) . 
values(d_e(5,3) , [on, off] , [0.7,0.3]) . 



values(d_e(2,3) , [on, off] , [0.8,0.2]) . 
values (d_e( 1,6) , [on, off] , [0.7,0.3]) . 
values(d_e(6,5) , [on, off] , [0.4,0.6]) . 
values(d_e(5,4) , [on, off] , [0.2,0.8]) . 



d_e(l,2):- msw(d_e (1 , 2) ,on) . 

d_e(3,4):- msw(d_e(3,4) ,on) . 

d_e(2,6):- insw(d_e(2,6) ,on) . 

d_e(5,3):- msw(d_e(5,3) ,on) . 



d_e(2,3):- msw(d_e(2,3) ,on) . 

d_e(l,6):- msw(d_e (1 ,6) , on) . 

d_e(6,5):- msw(d_e(6,5) ,on) . 

d_e(5,4):- msw(d_e(5,4) ,on) . 



path(X,Y) :- path(X,Y, [X]) . 
path(X,X,_) . 

path(X,Y,A) :- X\==Y, (d_e(X,Z) ; d_e(Z,X)), absent(Z,A), path(Z,Y, [Z| A] ) . 
absent (_,[]). 

absent(X, [Y|Z]) :- X\==Y, absent (X,Z). 



Fig. 5. A graph program violating the exclusiveness condition 



Observe that a ground top-goal path(X,Y) causes a call to d_e(X,Z) with X 
ground and Z free that calls more than one clause, which leads to the violation 
of the exclusiveness condition. Nonetheless we can learn parameters by VT and 
compute the Viterbi explanation for this program. 

Fig. iniis a sample session doing this. In Fig. [51 we first compute the Viterbi path 
VE, i.e. the most probable path between node 1 and node 4 and its probability P by 
applying the built-in predicate viterbif /3 to goal pathCl .4)F^. We next renew pa- 
rameters by learning them using VT from observed goals path (1,4) , path (1,3) . . .0 
Finally we compute the Viterbi path again that is determined by learned parame- 

viterbif (path(l ,4) ,P ,_X) returns respectively the Viterbi explanation in JC and its probability 
in P. viterbi_switches(_X,VE) extracts the Viterbi path VE as a conjunction of msw atoms. 
set_prism_f lag(learn_mode ,ml_vt) tells the PRISM system to use VT when learn/1 is invoked. 



?- viterbif (path (1,4) ,P,_X) ,viterbi_switches(_X,VE) 
P = 0.432 

VE = [msw(d_e(l,2) ,on) ,msw(d_e(2,3) ,on) ,msw(d_e(3,4) ,on)] 
?- set_prism_f lag(learn_mode ,ml_vt) . 

?- learn( [path(l,4) ,path(l,3) ,path(2,4) ,path(2,5) ,path(3,6)] ) . 

?- viterbif (pathd, 4) ,P,_X) ,viterbi_switches(_X,VE) 
P = 0.104 

VE = [msw(d_e(l ,6) , on) ,msw(d_e (6 , 5) , on) ,msw(d_e (5 ,4) , on)] 



Fig. 6. A sample session 
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ters and see whether the learning changes the Viterbi path or not. In the session, 
the Viterbi path changed after learning ft-om 1 -> 2 -> 3 -> 4 to 1 -> 6 -> 5 
-> 4 together with their probabilities from 0.432 to 0.161. 

VT thus enables us to learn parameters from programs that violate the exclusive- 
ness condition. However we have to recall at this point that VT has an objective 
function different from likelihood and is biased ( [Lember and Koloydenko 2007[ ), 
and the effect of removing the exclusiveness condition on the quality of parameters 
estimated by VT is unknown at the moment, which remains as a future research 
topic. 



7 Related work and discussion 

VT is closely related to K-means ( [MacQueen 1967[ ) which is a standard clustering 
method for continuous data. If we apply VT to a Gaussian mixture for clustering of 
continuous data with an assumption of a common variance to all composite Gaus- 
sian distributions, the resulting algorithm is identical to K-means. In this sense, 
the usefulness of VT is established. Actually VT has been used in various settings 
(jBrown et al. 1993| |Juang and Rabiner 1990{ IStrom et al. 1999| IJoshi et al. 2006t 



Spitkovsky et al. 2010 ILomsadze et al. 2005|) and also in the SRL frameworks that 
deal with structured data ( |Singla and Domingos 2005{ |Huynh and Mooney 2010D 
where the algorithmic essence of VT, coordinate ascent on parameters and target 
variables with argmax operation applied to the latter, is used. 

Despite its popularity however, it seems that VT so far has been model-specific 
and only model-specific VT algorithms have been implemented. In this paper we 
gave a unified treatment to VT for discrete models for the first time to our knowl- 
edge, and derived the VT algorithm for PRISM which is a single generic algorithm 
applicable to any discrete model as long as the model is described by a PRISM 
program. Since our derivation of VT is based on the reduction of goals to AND-OR 
propositional formulas, it seems quite possible for other logic-based modeling lan- 



guages that use BDDs such as ProbLog (jPe Raedt et al. 20071 |Kimmig et al. 2008 1 



and PITA ( [Riguzzi and Swift 2011[ ) to introduce VT as a parameter learning rou- 
tine. 

One of the unique features of VT is its affinity with discriminative modeling. 
Write the VT's objective function Lvriv \ 0) as follows. 

LvT{v\e) = log p{x\y\e)p{e) 

X* — argmax^. p{x, y \ 6)p{6) 
— argmaXj. p{x \ y,6)- 

This means that although PRISM is intended for generative modeling, VT in 
PRISM computes the Viterbi explanation x* that gives the highest conditional 
probability p{x* \ y,9) for y whose form is identical to the objective function in 
discriminative modeling and the Viterbi explanation is chosen in the same way as 
discriminative modeling does provided the hidden variable is a prediction target. 
When this condition is met VT shows good performance as demonstrated by the 



20 



T.Sato and K.Kuhota 



experiments in Section |4] but if not, VT does not necessarily outperform other pa- 
rameter learning methods as exemplified in Section [S] It therefore seems reasonable 
to say that VT is effective for prediction tasks when the prediction target coincides 
with hidden variables in a probabilistic model, though we obviously need more 
experiments. 

As a coordinate ascent local hill-climber, VT is sensitive to the initial parameters 
and also sensitive to the Viterbi explanation. To mitigate the sensitivity problem 
with initial parameters, we used 50 time random restart in the learning experiments 
in Section m To cope with the sensitivity to the Viterbi explanation, it is interesting 
to introduce A;-best explanations as discussed in (jGutmann et al. 2008^ and replace 
the Viterbi explanation in VT with them. This approach will give us control over 
the sensitivity and computation time by choosing k and seems not very difficult to 
implement in PRISM as A;-best explanations for a goal G are already computed by 
built-in predicates such as n_viterbi (/c , (?) . 

Since VT in PRISM runs on explanation graphs obtained from all solution search, 
it requires time for all solution search (by tabling) and also space to store discovered 
explanation graphs. It is possible, however, to implement VT without explanation 
graphs, and to realize much more memory saving VT by repeating search for a 
Viterbi explanation in each cycle of VT. We note this approach particularly fits well 
with mode-directed tabling (jZhou et al. 2010]) . In mode-directed tabling, we can 
search for partial Viterbi explanations for subgoals efficiently without constructing 
explanation graphs and put them together to form a larger Viterbi explanation for 
the goal. Currently however mode-directed tabling is not available in PRISM. We 
are planing to incorporate it in PRISM in the near future. 



8 Conclusion 

We introduced VT (Viterbi training) to PRISM to enhance PRISM's probabilis- 
tic modeling power. PRISM becomes the first SRL (statistical relational learning) 
language (jGetoor and Taskar 20071 De Raedt and Kersting 2008| in which VT is 



available for parameter learning to our knowledge. 

Although VT has already been used in various models under various names 
Prown et al. 19931 |Juang and Ra biner 1990} IStrom et al. 19991 IJoshi et al.^006l 



Spitkovsky et al. 2010; Lomsadze et al. 2005)) . we made the following contributions 



to VT. One is a generalization by deriving a generic VT algorithm for PRISM, 
thereby making it uniformly applicable to a very wide class of discrete models 
described by PRISM programs ranging from BNs to probabilistic grammars. The 
other is an empirical clarification of conditions under which VT performs well. We 
conducted learning experiments with a PCFG and a PLCG using VT and con- 
firmed VT's excellent parsing performance compared to EM, MAP and VB. We 
also conducted a learning experiment with NBH for classification tasks. Putting 
the results of these experiments together, we may say that VT performs well when 
hidden variables are a prediction target. 

From the viewpoint of PRISM, VT improves PRISM first by realizing faster con- 
vergence compared to EM, second by providing the user with a parameter learning 
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method that can learn parameters good for prediction, and third by providing a 
solution to the problem of the exclusiveness condition that hinders PRISM program- 
ming. Thanks to VT, we are now able to use arbitrary programs with inclusive-or 
for probabilistic modeling. 

Last but not least we can say that as VT in PRISM is general and applicable to 
any PRISM program, it largely reduces the need for the user to develop a specific VT 
algorithm for a specific model. Furthermore since VT in PRISM can be used just by 
setting a PRISM flag appropriately, it makes VT easily accessible to (probabilistic) 
logic programmers. 
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